AITopics | delayed reward

Collaborating Authors

delayed reward

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

RUDDER: Return Decomposition for Delayed Rewards

anonymous

Neural Information Processing SystemsFeb-11-2026, 13:56:14 GMT

reinforcement learning; delayed reward; reward redistribution; return decomposition; bias-variance; credit assignment; LSTM

infinitesimal change, reward redistribution, rudder, (13 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.61)

Add feedback

RUDDER: Return Decomposition for Delayed Rewards

Neural Information Processing SystemsDec-25-2025, 01:42:32 GMT

We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs). In MDPs the Q-values are equal to the expected immediate reward plus the expected future rewards. The latter are related to bias problems in temporal difference (TD) learning and to high variance problems in Monte Carlo (MC) learning. Both problems are even more severe when rewards are delayed. RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward. We propose the following two new concepts to push the expected future rewards toward zero.

name change, return decomposition, rudder, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards

Neural Information Processing SystemsDec-23-2025, 18:34:33 GMT

Incrementality, which measures the causal effect of showing an ad to a potential customer (e.g. a user in an internet platform) versus not, is a central object for advertisers in online advertising platforms. This paper investigates the problem of how an advertiser can learn to optimize the bidding sequence in an online manner \emph{without} knowing the incrementality parameters in advance. We formulate the offline version of this problem as a specially structured episodic Markov Decision Process (MDP) and then, for its online learning counterpart, propose a novel reinforcement learning (RL) algorithm with regret at most $\widetilde{O}(H^2\sqrt{T})$, which depends on the number of rounds $H$ and number of episodes $T$, but does not depend on the number of actions (i.e., possible bids). A fundamental difference between our learning problem from standard RL problems is that the realized reward feedback from conversion incrementality is \emph{mixed} and \emph{delayed}. To handle this difficulty we propose and analyze a novel pairwise moment-matching algorithm to learn the conversion incrementality, which we believe is of independent interest.

incrementality bidding, name change, reinforcement learning, (8 more...)

Neural Information Processing Systems

Industry: Education (0.60)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

RUDDER: Return Decomposition for Delayed Rewards

anonymous

Neural Information Processing SystemsOct-2-2025, 05:30:42 GMT

reinforcement learning; delayed reward; reward redistribution; return decomposition; bias-variance; credit assignment; LSTM

artificial intelligence, machine learning, reward redistribution, (16 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games (0.35)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.61)

Add feedback

Reviews: RUDDER: Return Decomposition for Delayed Rewards

Neural Information Processing SystemsJan-21-2025, 22:34:08 GMT

The reward redistribution method is proven to preserve optimal policies and reduce the expected future reward to zero. This is achieved by redistributing the delayed rewards to the salient state-action events (where saliency is determined by contribution analysis methods). Extensive experiments in both toy domains, as well as the suite of Atari games, demonstrate the method's improvements for delayed reward tasks, as well as the shortcomings of MC and TD methods for these types of tasks. Comments: I felt the work presented in the paper is outstanding. There are numerous contributions that could conceivably stand on their own (resulting in an extremely large appendix!).

experiment, return decomposition, rudder, (8 more...)

Neural Information Processing Systems

Industry: Leisure & Entertainment > Games > Computer Games (0.59)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.56)

Add feedback

Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning

Tang, Yuting, Cai, Xin-Qiang, Pang, Jing-Cheng, Wu, Qiyu, Ding, Yao-Xiang, Sugiyama, Masashi

arXiv.org Artificial IntelligenceOct-26-2024

Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals. Unfortunately, designing high-quality instance-level rewards often demands significant effort. An emerging alternative, RL with delayed reward, focuses on learning from rewards presented periodically, which can be obtained from human evaluators assessing the agent's performance over sequences of behaviors. However, traditional methods in this domain assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards, both of which often do not align well with real-world scenarios. In this paper, we introduce the problem of RL from Composite Delayed Reward (RLCoDe), which generalizes traditional RL from delayed rewards by eliminating the strong assumption. We suggest that the delayed reward may arise from a more complex structure reflecting the overall contribution of the sequence. To address this problem, we present a framework for modeling composite delayed rewards, using a weighted sum of non-Markovian components to capture the different contributions of individual steps. Building on this framework, we propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism to effectively model these contributions. We conduct experiments on challenging locomotion tasks where the agent receives delayed rewards computed from composite functions of observable step rewards. The experimental results indicate that CoDeTr consistently outperforms baseline methods across evaluated metrics. Additionally, we demonstrate that it effectively identifies the most significant time steps within the sequence and accurately predicts rewards that closely reflect the environment feedback.

artificial intelligence, machine learning, reinforcement learning, (17 more...)

arXiv.org Artificial Intelligence

2410.20176

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > China > Jiangsu Province > Nanjing (0.04)

Genre: Research Report > New Finding (0.68)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Incrementality Bidding via Reinforcement Learning under Mixed and Delayed Rewards

Neural Information Processing SystemsOct-9-2024, 16:55:40 GMT

Incrementality, which measures the causal effect of showing an ad to a potential customer (e.g. a user in an internet platform) versus not, is a central object for advertisers in online advertising platforms. This paper investigates the problem of how an advertiser can learn to optimize the bidding sequence in an online manner \emph{without} knowing the incrementality parameters in advance. We formulate the offline version of this problem as a specially structured episodic Markov Decision Process (MDP) and then, for its online learning counterpart, propose a novel reinforcement learning (RL) algorithm with regret at most \widetilde{O}(H 2\sqrt{T}), which depends on the number of rounds H and number of episodes T, but does not depend on the number of actions (i.e., possible bids). A fundamental difference between our learning problem from standard RL problems is that the realized reward feedback from conversion incrementality is \emph{mixed} and \emph{delayed}. To handle this difficulty we propose and analyze a novel pairwise moment-matching algorithm to learn the conversion incrementality, which we believe is of independent interest.

delayed reward, incrementality bidding, reinforcement learning, (5 more...)

Neural Information Processing Systems

Industry: Education (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback